# CRM analytics || RFM analysis || LTV prediction

Table of Contents:


Introduction

CRM analytics is the process of analyzing customer data to enhance relationships, boost retention, and stimulate sales growth. It aids in understanding customer behavior, personalizing products and services, forecasting sales, and improving customer service.

In this study, we will conduct a detailed analysis of customer data from an international online store. Our objective is to identify key metrics that will help us gain deeper insights into our customers' actions and the effectiveness of our marketing campaigns. Consequently, we will segment customers and forecast their performance for the upcoming year.

We hope that this analysis will help us improve our work and increase our profits. Let's get started.

Importing libraries, uploading datasets


HOME || Introduction

Part I - Exploratory Data Analysis & Probability

Table of Contents:

Description:

1.1 | Description of the dataset Orders

Below is the description of the data, there are a total of 5 columns and 541909 rows:

Data columns Purpose Types
order_id Unique purchase number object
vendor_code Product code object
quantity Quantity of goods int64
invoice_date Purchase invoice date datetime64[ns]
unit_price Cost of 1 unit of goods, in $. float64

HOME || Part I

1.2 | Description of the dataset Customers

Below is the description of the data, there are a total of 5 columns and 25900 rows:

Data columns Purpose Types
order_id Unique purchase number object
customer_id Unique client number object
country Country from which the order was made object
channel Channel where the customer came from: organic - organic search engine results or direct access to the site, fb_ads - Facebook Ads, g_ads_s - Advertising on Google search results, g_ads_b - Google banner advertising, remarketing, other_banner - Other banner advertising on partner sites object
cost Cost of advertising, transition, in $. float64

HOME || Part I

1.3 | Summary dataset

Let's combine the data into one dataset


HOME || Part I

1.4 | Duplicates

Let's get duplicate rows - 5270

Removing duplicates

Saving a basic copy of the dataset


HOME || Part I

1.5 | Time period in the dataset


HOME || Part I

1.6 | Number of clients and orders by country

  • The countries with the highest number of unique customers and orders are the USA and the United Kingdom. This may indicate that these markets are primary for your business
  • In some countries, such as Lithuania and EIRE, the number of unique customers is relatively small, but the average number of orders per customer is higher. This may indicate that customers from these countries make more purchases on average
  • It's important to note that the number of orders is not always directly proportional to the number of customers. This can be related to various factors such as repeat purchases, order size, etc.
  • The average number of orders per customer can provide additional insights into customer behavior. For example, a high average might indicate a high level of customer loyalty, while a low average might suggest that many customers make only one purchase
  • For a deeper analysis, you can consider other metrics that can help better understand customer behavior, such as the total revenue per customer, the average order value, or the frequency of repeat purchases

HOME || Part I

1.7 | Canceled and erroneous orders

Let's look at the data where the 'quantity' is less than or equal to 0.

Let's look at the data where the 'unit_price' is less than or equal to 0.

  1. Definitely, we need to remove from the data orders in which quantity is less than 0. Also remove data where unite_price is less than 0
  2. What to do with the data (2510 rows), where unite_price equals = 0?

Let's take into account that products with unite_price = 0 are free products (promotional).

Сonclusions:

  • Countries with the highest number of cancelled or returned orders may indicate issues related to delivery, product quality, or customer service levels in these regions.
  • The percentage of losses can serve as an indicator of business process efficiency. A high percentage may indicate a need for improvements in order management processes, while a low percentage may indicate a high level of customer satisfaction.
  • For a deeper analysis, other metrics that can help better understand customer behavior, such as total revenue per customer, average order value, or frequency of repeat purchases, can be considered.

HOME || Part I

1.8 | Processing canceled orders

Let's display individual orders to compare.

What conclusions can be drawn?

Based on the date and channel of customer acquisition, paired orders can be divided into 3 groups:

  1. An order is placed and then cancelled.
  2. The order was canceled and then re-placed.
  3. The order is made, then canceled, then made again.

There may be anomalies when receiving data. It is recommended to highlight data with canceled orders and their pairs for further research.

Let's remove group 1: An order is made and then cancelled.

As well as all other orders [startswith "C"] and having ['quantity'<=0] and ['unit_price'<0].

We leave only those rows where all orders starting with 'C' have a later date than all other orders in this row.

Remaining canceled orders after filtering

Let's delete the data where 'quantity'<=0 or 'unit_price'<0


HOME || Part I

1.9 | Column "Revenue"

Create an income column called "Revenue". Next, we group the data by orders.

It will be interesting to see the maximum values. Perhaps these are errors.

The results are questionable. These orders must be looked at separately.


HOME || Part I

1.10 | Grouping data by Orders

Grouping data by orders


HOME || Part I

1.11 | Created a Customer Profile

Grouping data by customers

Create a DataFrame with dummy variables for each channel and customer profile

Created a Customer Profile

Checksum verification between datasets


HOME || Part I

1.12 | Working with emissions. Maximum Value Analysis

You need to look at the data where revenue is 0. And also orders with maximum revenue

Let's display data outliers

  • Clients and orders with zero revenue, but funds were spent on attraction. Such data can be deleted.
  • There are highly profitable orders and customers. If we take data on orders, then 1% of orders makes up 18,1% of total revenue. And if we take data on customers, then 1% of customers brought in 27,1% of the total income. Such clients cannot be excluded, they must be considered individually.

HOME || Part I

Part II - Analysis of General Indicators

Table of Contents:

Description:

2.1 | Indicators by Сountry

What countries do visitors come from? Which countries provide the most paying users?

From the data, we can draw the following conclusions:

  • The USA is the country with the highest number of paying customers (1530), accounting for 29.9% of the total number of paying customers. This country also brings in the highest revenue - 2175922.781, which is 25.49% of the total revenue.
  • The second-largest market is the United Kingdom with 1261 paying customers (24.64% of the total) and a revenue of 1996822.431 (23.39% of the total revenue).
  • Germany ranks third with 498 paying customers (9.73% of the total) and a revenue of 1021030.100 (11.96% of the total revenue).
  • Despite the Netherlands having only 1% of the total number of paying customers (51 customers), they contribute 3.88% of the total revenue, indicating a high average order value in this country.
  • Saudi Arabia has the lowest performance, accounting for just 0.55% of total paying customers (28 customers), contributing 0.24% of total revenue, which also indicates a relatively low average order value.

Preliminarily Based on the relative quality of customers, we can draw a conclusion about promising sales regions.


HOME || Part II

2.2 | Indicators by customer Acquisition Channels

What advertising channels were used to attract users? Which channels bring in the most paying users?

Results:

  1. Organic traffic turns out to be the most effective customer acquisition channel, attracting the largest number of paying customers (2,777, representing 27.6% of total paying customers) and generating the highest revenue - 2,705,519,191, representing 31.7% from total income. It's important to note that no money was spent on acquiring these customers.
  2. Facebook Ads (fb_ads) are the second most effective, attracting 2,022 paying customers (20.1% of total paying customers) and generating 1,752,351,471 in revenue (20.5% of total revenue). 2965.15 was spent on attracting these clients. Average acquisition cost per purchase - 0,84.
  3. Google Ads Media (g_ads_b) ranks third, attracting 1,839 paying customers (18.3% of total paying customers) and generating 1,587,409,852 in revenue (18.6% of total revenue). 261.04 was spent on attracting these clients. Great, almost free. Average acquisition cost per purchase - 0,086.
  4. Other banners (other_banner) attract 1779 paying customers (17.7% of total paying customers) and generate 1296409.400 in income (15.2% of total income). 1332.60 was spent on attracting these clients. Average acquisition cost per purchase - 0,46.
  5. Advertising in Google Ads in search results (g_ads_s) was the least effective, attracting 1635 paying customers (16.3% of total paying customers) and generating 1195413.350 in revenue (14.0% of total revenue). The largest amount of funds was spent on attracting these clients - 4694.11. Average acquisition cost per purchase - 1,82.

In general, we can say that Organic traffic is the most effective channel for attracting customers, as it generates the most revenue and does not require any costs.

However, advertising in Google Ads Search in search results turned out to be the least effective, despite the highest costs.


HOME || Part II

Part III - Analysis of the Distribution of Indicators over Time

Table of Contents:

Description:

3.1 | Distribution of Indicators for each month

Let's display general indicators by month

  • Advertising costs (adv_cost) increase over time. At the beginning of 2022 January, the costs were 476.54, and by the end of the year, they increased to 1184.07. It is particularly noteworthy that there was a significant increase in costs starting from September, which may be associated with an increase in advertising campaigns in anticipation of the holiday season.
  • The highest costs were in November 2022 (1500.41). This could be associated with an increase in advertising campaigns in anticipation of the holiday season.
  • The lowest costs were in January, February, and April 2022, amounting to 476.54, 495.94, and 561.09 respectively. This could be associated with fewer advertising campaigns after the New Year holidays and during the spring break period. However, despite the low advertising costs in January, the ARPU (502,06) in this month was above average, indicating the high effectiveness of advertising campaigns during this period.
  • Overall, advertising costs vary from month to month, which could be associated with various factors such as seasonality, special advertising campaigns, and changes in advertising strategy.
  • Total revenue (total_revenue) also increases over time, which may be associated with an increase in advertising costs. It is particularly noteworthy that there was a significant increase in revenue starting from September (1323611), which coincides with the increase in advertising costs.
  • Average Revenue Per User (ARPU) and Average Revenue Per Paying User (ARPPU) also increase over time, indicating that the quality of attracted users is improving. The lowest ARPU and ARPPU values are observed in February and April, which coincide with the periods of the lowest advertising costs.
  • The number of unique users (unique_customers) and the total number of orders (total_orders) also increase over time, indicating successful attraction of new users and increased activity of existing users. It is particularly noteworthy that there was a significant increase in these indicators starting from September, which coincides with the increase in advertising costs.
  • The best months for all indicators are November and December. ARPPU was $755 and $814, respectively.

HOME || Part III

3.2 | Revenue & ARPPU over time

Group by date and feature if provided, calculate total revenue and number of unique customers

3.2.1 | Revenue and ARPPU over time for All Data

  • In general, Revenue and ARPPU fell after January, and then began to grow over time.
  • At the same time, drawdowns are visible in some weeks of the month Revenue and ARPPU grow over time

3.2.2 | Revenue and ARPPU over time by Acquisition channels

  • The dynamics of income over time by channel are positive. It is best from the Organic channel.
  • The time distribution of ARPPU is growing. The following channels can be designated as outsiders: other_banner, g_ads_s.

3.2.3 | Revenue and ARPPU over time by top Сountries

  • In terms of income, growth dynamics are more pronounced in the following countries: USA, UK, Germany.
  • According to ARPPU, the dynamics of the distribution over time shows that in some months there is a slight drop. By collapsing the chart weekly, you can monitor the dynamics for individual countries.

HOME || Part III

3.3 | LTV & ROI & CAC over time

LTV (Customer Lifetime Value) is a marketing metric that measures the total value a customer can bring to a business over the length of their relationship with the company. This can help businesses prioritize their marketing efforts and understand the potential long-term impact of individual customers. LTV is typically calculated by taking a customer's past spending, and extrapolating the expected future spending based on their behavior. It is used by businesses across various industries to gain a competitive edge, make informed decisions, and improve their overall profitability.

ROI (Return On Investment) is the profit or return generated by an investment. It is calculated by subtracting the initial investment cost from the total revenue generated, and dividing the result by the initial investment cost. A positive ROI indicates that the investment was profitable, while a negative ROI indicates that the investment was not profitable.

CAC stands for Cost Average Conversion. It is a marketing metric used to calculate the cost of acquiring a new customer and converting them into a paying customer. It involves dividing the total cost of acquiring customers by the number of customers acquired. CAC is an essential metric for businesses to track and manage their customer acquisition costs and make

Calculate LTV and ROI for given customers and orders data

Plot LTV and ROI

Plot Customer Acquisition Cost (CAC)

3.3.1 | LTV & ROI & CAC over time - Total

Сonclusions:

  1. LTV (Customer Lifetime Value): This indicator shows how much money one customer brings in over the entire period of cooperation with the company. The graphs show that LTV increases slightly over time. This means that customers continue to generate income after the first purchase. The peak is reached in December, and spikes are also visible in other months of the year.
  2. ROI (Return On Investment): Indicator of advertising effectiveness and return on advertising costs. If the ROI is greater than 1, then advertising is profitable; if it is less than 1, it is unprofitable. The graphs show that the ROI for all advertising channels is many times greater than 1, which indicates their effectiveness.
  3. CAC (Customer Acquisition Cost): This indicator shows how much it costs to attract one paying customer. The graph shows that CAC varies across time periods.

3.3.2 | LTV & ROI & CAC over time - by Acquisition channels

Сonclusions:

  1. The graphs show that LTV is growing over time for all advertising channels. This means that customers continue to generate income after the first purchase.
  2. The graphs show that the ROI for all advertising channels is many times greater than 1, which indicates their effectiveness. For the g_ads_s channel, the ROAS is the least, which indicates its lowest efficiency compared to others.
  3. The graph shows that CAC varies depending on the advertising channel. The g_ads_s channel has the highest CAC, which, combined with low ROI, indicates its lowest efficiency.
  4. The best results are shown by g_ads_b (Google Ads Media), the ROI is noticeably higher, and the CAC is minimal than the others, with the highest LTV. However, there are noticeable gaps in ad impressions. What is this connected with?
  5. You can notice a peak in indicators for fb_ads in the August-September period. You can also notice the stability of this attraction channel.

In general, based on these data, we can conclude that the advertising channels fb_ads, g_ads_b and other_banner are the most effective, while the g_ads_s channel requires additional analysis and possible optimization.

3.3.3 | LTV & ROI & CAC over time - Top Сountries

  1. The graphs show that LTV for the TOP10 countries by total revenue falls in January-February and increases in November-December. This means that there is an obvious decline in purchases after the new year and an increase before it.
  2. The graphs show that the ROI for the TOP10 countries in terms of total revenue is many times higher than 1, which indicates their effectiveness.
  3. The highest LTV and ROI indicators are observed for Lithuania, Australia, and the Netherlands, which indicates the high efficiency of attracting customers in this country.

HOME || Part III

3.4 | Retention & LTV Cohort Analysis

Customer retention is a key metric that reflects a company's ability to retain customers over a certain period of time. It's an important aspect of business, as attracting new customers is typically more costly than retaining existing ones.

Customer retention can be calculated using the following formula:

Retention = ((CE - CN) / CN) * 100

where:

For example, if you had 100 customers at the beginning of the month and 90 customers at the end of the month, your retention rate would be 90%.

Customer retention is important for several reasons. Firstly, it helps increase profitability, as retaining existing customers costs less than acquiring new ones. Secondly, customer retention can contribute to business growth, as satisfied customers often become brand ambassadors and attract new customers. Finally, a high level of customer retention can serve as an indicator of the quality of a company's products or services.

However, it's important to remember that customer retention is not the only metric to consider. For a complete picture, it's also important to consider other metrics such as profit per customer, customer acquisition cost, and customer satisfaction level.

LTV (Lifetime Value) is a metric that shows the average amount of money a single customer brings over the entire period of their activity. It's an important metric that helps understand how much you can afford to spend on acquiring new customers and retaining existing ones.

The calculation of LTV using the cohort method involves the following steps:

The simple formula for LTV calculation is:

LTV = (Total Revenue - Total Cost) / Total Number of Customers

LTV is calculated as the difference between total revenue and total cost, divided by the number of customers in the cohort. This gives the average LTV value for each customer in the cohort.

So, the LTV calculated using the cohort method shows the average amount of profit each customer from a certain cohort brings over the entire period of their activity. This helps understand how customer profitability changes over time and how to effectively manage costs for customer acquisition and retention.

Explanations of the graphs:

3.4.1 | Retention & LTV Cohort Analysis - by Total

Visually, the retention dynamics for all data shows general trends:

  1. Seasonality. We see repeating patterns in retention dynamics. This may indicate seasonality in your users' behavior. Retention is lowest in April and gradually increases reaching highs at the end of the year.
  2. Significant changes in retention are noticeable for customers who came in January 2022. They have a high retention rate.
  3. The highest retention is observed for January customers in November and December - 41 and 43%. Why is that? This could be due to changes in your product, marketing campaigns, or external environment for January customers.

From the ltv matrix data, we can draw the following conclusions:

  1. The maximum Lifetime Value (LTV) is observed for customers acquired in December 2022, amounting to 747.41. This indicates a high profitability of these customers for the company.
  2. The minimum LTV is observed for customers acquired in June 2022, in the second period (July), and is 39.66. This may indicate that these customers bring less revenue to the company.
  3. The average LTV of customers acquired in January 2022 steadily increases over time, reaching a peak in December 2022 with a value of 522.05. This indicates that these customers continue to bring revenue to the company throughout the year.
  4. For customers acquired in subsequent months, a similar trend is observed, although the overall LTV is slightly lower. For example, customers acquired in February 2022 reach a peak LTV in the tenth period with a value of 302.01.

3.4.2 | Retention & LTV Cohort Analysis - by Acquisition Channels

Retention in the context of customer acquisition channels by monthly cohorts has the following trends:

  1. For customers of the January cohort, periods 10 (November) and 11 (December) have the maximum retention values. The leaders are other_banner (73, 68%), organic (52, 72%), g_ads_s (62, 61%). What is the reason for customer returns at the end of the year?
  2. But if we take periods 2 and 3, then on the contrary, low retention is observed in organic and other_banner. In some areas it drops to 7%. What is causing this instability?

From the ltv matrix data by channel, we can draw the following conclusions:

  1. The maximum Lifetime Value (LTV) is observed for customers acquired through g_ads_s in December 2022, amounting to 1322.35. This indicates a high profitability of these customers for the company.
  2. The minimum LTV is observed for customers acquired through g_ads_s in November 2022, in the second period, and is 84.67. This may indicate that these customers bring less revenue to the company.
  3. The average LTV of customers acquired through fb_ads in January 2022 steadily increases over time, reaching a peak in the 11th period with a value of 348.81. This indicates that these customers continue to bring revenue to the company throughout the year.
  4. For customers acquired in subsequent months, a similar trend is observed, although the overall LTV is slightly lower. For example, customers acquired through fb_ads in February 2022 reach a peak LTV in the seventh period with a value of 456.63.

3.4.3 | Retention & LTV Cohort Analysis - by Top Countries

There are many countries. Therefore, let’s look at customer Retention & LTV according to the list of TOP5 countries by revenue and TOP5 arppu.

Retention in the context of country by monthly cohorts has the following trends:

  • Customer retention for countries with large cohort sizes is relatively stable. Only the January cohort graduates.
  • For small cohorts, the reduction value jumps. This is natural, since in some month the clients did not return, and in another month all of them may return.

From LTV, we can draw the following conclusions:

  • The maximum Lifetime Value (LTV) is observed for customers from the Netherlands acquired in January 2022, amounting to 5117.45. This indicates a high profitability of these customers for the company.
  • The minimum LTV is observed for customers from Canada acquired in August 2022, amounting to 16.28. This may indicate that these customers bring less revenue to the company.
  • The average LTV of customers from Australia acquired in January 2022 steadily increases over time, peaking in the 9th period with a value of 1449.96. This indicates that these customers continue to bring revenue to the company throughout the year.
  • For customers acquired in subsequent months, a similar trend is observed, although the overall LTV is slightly lower. For example, customers from Australia acquired in February 2022 reach a peak LTV in the 9th period with a value of 914.07.

Maximum peak values LTV for each country:

  1. Australia: is observed in December 2022 with a value of 3828.93.
  2. Canada: is observed in October 2022 with a value of 2030.30.
  3. EIRE: is observed in January 2022 with a value of 1992.10.
  4. Germany: is observed in February 2022 with a value of 1395.79.
  5. Netherlands: is observed in January 2022 with a value of 5117.44.
  6. Singapore: is observed in March 2022 with a value of 4566.99.
  7. Switzerland: is observed in April 2022 with a value of 1541.19.
  8. USA: is observed in June 2022 with a value of 562.98.
  9. United Kingdom: is observed in December 2022 with a value of 789.70.

3.4.4 | Average customer retention for all periods. Average customer lifetime.

The average customer retention by acquisition channels showed:

  • Organic customers have the highest retention (22.4%), which indicates the stability of the acquisition channel.
  • Channel g_ads_s has the lowest average retention (11.5%). But at the same time, in November-December this channel was one of the leaders.
  • The customer’s lifetime was distributed according to retention and ranged from 35 to 56 days.
  • Average customer retention by country only objectively reflects the reality of countries with a sufficient number of customers. Because, with a small number of clients, in some periods the deduction will be zero.
  • Longest customer lifetime in the Czech Republic (133). And the smallest is in Malta (61 days).

HOME || Part III

Part IV - RFM Analysis

Table of Contents:

RFM Analysis

In a business context: A clustering algorithm is a method that helps to segment customers, that is, the process of classifying similar customers into the same segment. The clustering algorithm helps to better understand customers from the point of view of both static demographics and dynamic behavior. Customers with comparable characteristics often interact with the business in the same way, so the business can benefit from this methodology by creating an individual marketing strategy for each segment. Based on this, customers can be offered discounts, offers, promo codes, etc. As a simple example. The company wants to offer its customers discounts. Currently, they are studying the details of each customer and based on this information decide what offer to make to a particular customer. The company could potentially have thousands of customers. Does it make sense to look at the details of each customer separately and then make a decision? Of course not! This is a manual process and will take a huge amount of time. So, what can the company's marketing department do? One option is to divide customers into different groups. For example, the company can group customers by RFM analysis.

The abbreviation RFM stands for:

Recency - recency (how long ago your customers were at your procedures). A high recency score means that the customer has already formed a good impression of your brand, so he has recently visited you. Recency in the customer base can be viewed if you sort customers by the date of the last purchase.

Frequency - frequency (how often they buy from you). A high frequency score indicates that the customer likes your brand, your products and services, so he often returns to you. To calculate the frequency of visits, you need to divide the total number of purchases/visits by the number of months/days/years, etc.

Monetary - money (total spending). A high level of this indicator means that the customer likes to spend exactly with you.

Description:

4.1 | Obtaining and analyzing RFM features

There is a problem with Frequency. Quartiles: {0.2: 1.0, 0.4: 2.0, 0.6: 2.0, 0.8: 4.0}. With this division, the majority of customers have 2 purchases. Therefore, we take this into account when segmenting.


HOME || Part IV

4.2 | Logical Segmentation. RFM Scores

Conclusion on logical segmentation:

  1. We tried to take into account all three features. However, the Recency trait has the greatest influence (inverse correlation) (-0.88). Next are Frequency (0.37) and Monetary (0.17).
  2. Segmented into 10 groups. The largest group was the hibernating (30 %). And the smallest number of simple loyal_customers (2 %), because rich_loyal_customers are allocated to a separate segment (8%).

HOME || Part IV

4.3 | K-Means Clustering

K-Means Segmentation is a type of unsupervised machine learning algorithm used to divide data into distinct groups or clusters based on similarity in features. It's commonly used in customer segmentation to group customers with similar behaviors or characteristics.

Using the elbow method, we will try to find a successful division into an acceptable number of clusters. We aim for 10 clusters to comply with manual division.

The Silhouette Score method is a measure used to determine the optimal number of clusters in a dataset. It calculates the average distance between each sample in a cluster and all other points in the next nearest cluster. The score ranges from -1 to 1, where a high value indicates that the sample is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters. The optimal number of clusters is typically determined by the highest Silhouette Score.

The Silhouette Index plot shows that the best separation is achieved with 4-6 clusters. However, the number of clusters is insufficient for separating customers and they are disproportionate.

Therefore, it was decided to segment the customers into 10 groups

  • When clustering our data with unsupervised machine learning using K-Means, a significant influence of Recency is noticed. However, the other two features are not sufficiently taken into account. This is due to the clustered arrangement of data by Recency, which is obvious, as these are dates. There are large outliers for the Frequency and Monetary features

Let's compare customer segmentation on a 3D graph

  • As previously concluded, we cannot focus solely on Recency. We need to take into account the other two features. Logical segmentation is better suited for this.

HOME || Part IV

4.4 | DBSCAN Clustering

Finally, let's try to perform clustering using the DBSCAN method of machine learning, which is based on a different principle. This method handles data outliers better.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular clustering algorithm used for data analysis and pattern recognition. It groups data points based on their density, identifying clusters of high-density regions and classifying outliers as noise

Let's run the models DBSCAN to find suitable hyperparameters

DBSCAN clustering

The DBSCAN algorithm did not accomplish our task. We were only able to select one pair of hyperparameters for dividing into 10 clusters. As a result, we got one cluster of a huge size and the rest were small. The DBSCAN clustering method did not suit us.


HOME || Part IV

4.5 | Evaluating the RFM Model and Segments

Let's evaluate our clustering models with special indexes. This is interesting. However, we will not rely on the index indicators, as we have settled on manual logical segmentation of customers.

We will use the following indexes:

  1. Silhouette Score, which can take a value from -1 to +1. The closer to 1, the better. A good indicator is considered to be above 0.6.

  2. Calinski Harabasz Score. The larger the value, the better when comparing models.

  3. Davies Bouldin Score. The smaller the value, the better in evaluating clustering models.

Based on the results of evaluating clustering models using indexes, we can draw the following conclusions:

  • DBSCAN has the lowest indicators, which is objective.
  • The K-means algorithm showed the best clustering results.
  • Logical segmentation has intermediate indicators, but it is the most suitable for work. Moreover, we do not need to interpret the value of clusters obtained during machine division.

Let's output the average values and ranges of our segments

Conclusions on evaluating logical customer segmentation:

  1. The "hibernating" segment contains the largest number of customers (1584), but the average Recency in this segment is the highest (203 days), indicating that customers from this segment have not made purchases for a long time. The average Monetary in this segment is 434.55, and the average Frequency is 1.26.
  2. The "need_attention" segment contains 850 customers with an average Recency of 57 days. This indicates that customers from this segment have made purchases relatively recently. The average Monetary in this segment is 583.04, and the average Frequency is 2.17.
  3. The "new_customers" segment contains 669 customers with the lowest average Recency (22 days), indicating that these customers made their last purchase very recently. The average Monetary in this segment is 306.33, and the average Frequency is 1.48.
  4. The "potential_loyalists" segment contains 522 customers with an average Recency of 20 days. This indicates that customers from this segment have made purchases relatively recently. The average Monetary in this segment is 928.86, and the average Frequency is 3.78.
  5. The "rich_loyal_customers" segment contains 433 customers with an average Recency of 15 days. This indicates that customers from this segment have made purchases very recently. The average Monetary in this segment is 8608.24, which is the highest among all segments, and the average Frequency is 14.80.
  6. The "can't_loose" segment includes 352 customers with an average Recency of 108 days. This indicates that customers from this segment have not made purchases for some time. The average Monetary in this segment is 4048.84, which is quite high, and the average Frequency is 3.79.
  7. The "at_risk" segment contains 247 customers with a high average Recency of 141 days. This indicates that customers from this segment have not made purchases for a long time. The average Monetary in this segment is 840.32, and the average Frequency is 3.54.
  8. The "rich_potential_loyalists" segment includes 158 customers with a low average Recency of 19 days, indicating that these customers made their last purchase very recently. The average Monetary in this segment is 3997.76, which is a high value, and the average Frequency is 3.91.
  9. The "rich_new_customers" segment contains 153 customers with a low average Recency of 18 days. This indicates that these customers made their last purchase very recently. The average Monetary in this segment is 3087.39, and the average Frequency is 1.73.
  10. The "loyal_customers" segment includes 149 customers with the lowest average Recency of 17 days, indicating that these customers made their last purchase very recently. The average Monetary in this segment is 1337.25, and the average Frequency is 7.51, which is one of the highest values among all segments.

Overall, these data can be useful for developing customer relationship management strategies and marketing campaigns.


HOME || Part IV

Part V - Predicting profitability indicators

Table of Contents:

BG/NBD

Beta Geometric / Negative Binomial Distribution (BG/NBD) models are predicated on the idea that each customer's transaction count adheres to a Poisson process, with a gamma distribution representing the variability in transaction rates among customers. These assumptions enable us to employ the Negative Binomial Distribution (NBD) to model the transaction count made by a customer during their "lifetime". The BG/NBD model can be constructed using the BetaGeoFitter and lifetimes packages.


HOME || Part V

5.1 | BG/NBD Model Calculation: Number of Transaction

Top 10 Expected Number of Transaction (1 Month)

Frequency of Repeat Transactions


HOME || Part V

5.2 | Gamma-Gamma Model Calculation: LTV

The Gamma-Gamma model is a statistical approach used in calculating the monetary value of transactions in customer lifetime value (CLTV) predictions. It is often used in conjunction with the BG/NBD model (Beta-Geometric/Negative Binomial Distribution model) for predicting future transactions.

The Gamma-Gamma model assumes that there is no relationship between the monetary value and the purchase frequency. In other words, the amount a customer spends on a purchase is independent of how often they make a purchase.

The model is called "Gamma-Gamma" because it assumes that the transaction value of customers follows a Gamma distribution, and the average transaction value also follows a Gamma distribution.

The model provides an estimate of the average transaction value for each customer, which can be used to predict the customer's lifetime value by multiplying it by the predicted number of future transactions from the BG/NBD model.

1 and 12 months CLTV Prediction

Make a 1-, 12-month CLTV prediction for customers.

Interpret and evaluate the results you have obtained.

To display the distribution of predicted LTV and ARPPU for the upcoming year compared to 2022

We can draw the following conclusions:

  1. The projected total revenue for the upcoming year is 10048406, which is 34.7% higher than the total revenue in 2022, which was 7454080. This could indicate expected business growth or increased customer activity.
  2. The average ARPPU (Average Revenue Per Paying User) in 2022 was $462, while the projected ARPPU for the upcoming year is $501. This means an expected increase in the average revenue per paying user by 8.4%.

These forecasts can be useful for budget planning, marketing campaigns, and customer relationship management strategies. However, it's important to remember that these are predictions, and actual results may vary depending on many factors.


HOME || Part V

5.3 | Additional customer segmentation based on their projected income.

Let's show the values between 0 and 1

Let's combine data and create aggregated segments

Additionally, we will mark VIP customers. In this case, these will be customers who spent more than $40,000 in 2022.

We will display the percentage ratio by segments. What is the share of each cluster we have identified?


HOME || Part V

Conclusions:

Analysis of cancelled or returned orders:

Profitability analysis:

Analysis of LTV & ROI & CAC over time:

Analysis of Retention & LTV by cohort analysis:

Customer segmentation:

Prediction of CLTV:

More detailed conclusions are placed in the corresponding sections. Summary tables have been obtained for orders, customer profiles, customer segmentation, predicted data on revenues for the next year. These data can be taken to form dashboards.

The conclusions and predictions made can be useful for budget planning, development of marketing campaigns and customer relationship management strategies.


HOME || auro2002@tut.by